OCR binarization and image pre-processing for searching historical documents
Identifieur interne : 000F29 ( Main/Exploration ); précédent : 000F28; suivant : 000F30OCR binarization and image pre-processing for searching historical documents
Auteurs : Maya R. Gupta [États-Unis] ; Nathaniel P. Jacobson [États-Unis] ; Eric K. Garcia [États-Unis]Source :
- Pattern recognition [ 0031-3203 ] ; 2007.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000355
- to stream PascalFrancis, to step Curation: 000431
- to stream PascalFrancis, to step Checkpoint: 000271
- to stream Main, to step Merge: 000F42
- to stream Main, to step Curation: 000F29
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">OCR binarization and image pre-processing for searching historical documents</title>
<author><name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author><name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author><name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">07-0059470</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 07-0059470 INIST</idno>
<idno type="RBID">Pascal:07-0059470</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000355</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000431</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000271</idno>
<idno type="wicri:doubleKey">0031-3203:2007:Gupta M:ocr:binarization:and</idno>
<idno type="wicri:Area/Main/Merge">000F42</idno>
<idno type="wicri:Area/Main/Curation">000F29</idno>
<idno type="wicri:Area/Main/Exploration">000F29</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">OCR binarization and image pre-processing for searching historical documents</title>
<author><name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author><name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author><name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
<imprint><date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Binary image</term>
<term>Despeckling</term>
<term>Dithering</term>
<term>Error diffusion</term>
<term>Filtering</term>
<term>Image processing</term>
<term>Implementation</term>
<term>Keyword</term>
<term>Multiresolution analysis</term>
<term>Noise reduction</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Performance evaluation</term>
<term>Printed document</term>
<term>Signal processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Image binaire</term>
<term>Mot clé</term>
<term>Document imprimé</term>
<term>Filtrage</term>
<term>Réduction bruit</term>
<term>Implémentation</term>
<term>Diffusion d'erreur</term>
<term>Analyse multirésolution</term>
<term>Evaluation performance</term>
<term>Tramage</term>
<term>Reconnaissance forme</term>
<term>Traitement signal</term>
<term>Traitement image</term>
<term>Déchatoiement</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Washington (État)</li>
</region>
<settlement><li>Seattle</li>
</settlement>
<orgName><li>Université de Washington</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Washington (État)"><name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
</region>
<name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F29 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F29 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:07-0059470 |texte= OCR binarization and image pre-processing for searching historical documents }}
This area was generated with Dilib version V0.6.32. |